Text book: Vehkalahti, Kimmo & Everitt, Brian S. (2019). Multivariate Analysis for the Behavioral Sciences , Second Edition. Chapman and Hall/CRC, Boca Raton, Florida, USA.
My GitHub repository: https://github.com/hkallo/IODS-project
My GitHub course diary: https://hkallo.github.io/IODS-project/
# Date of submission
date()
## [1] "Sun Nov 26 17:03:52 2023"
How are you feeling right now? Feeling good, looking forward to learn new things in this course. R for Health Data Science is truly great, and I have already after reading first sections, learnt several new things which will make my scripts better and more consise. The book has very nice explanations for everything
What do you expect to learn? I already have some experience with R statistical software and basics from statistical analysis. I expect to learn about open data sources and how to utilize them as well as about GitHub, which is completely new thing to me. It is also nice to get a recap of basic things of stats and perhaps find new ways to execute things in R.
Where did you hear about the course? I saw an email.
Also reflect on your learning experiences with the R for Health Data Science book and the Exercise Set 1: How did it work as a “crash course” on modern R tools and using RStudio? As I already have some knowledge of R, it was pretty easy and straightforward. However, if I had no experience with this language, I don’t think I would learn too much as it is so detached from actual script writing, and nothing really had to be done yourself. The book seems very nice tool to check how to execute things in R, together with assignments better though!
Which were your favorite topics? Which topics were most difficult? Favorite chapter were pipe (%<%) and several parts from Summarizing data chapter. These helped me to understand the functions that I have already used but apparently without understanding them perfectly:) It will be easier to use this in the future. I have also tried RMarkdown before but it will be nice to learn more about it. I did not know about the cheet sheets in R, that was a great tip!
Some other comments on the book and our new approach of getting started with R Markdown etc.? I guess during upcoming weeks we get to do coding ourselves so no other comments:)
#Assignment submitted
date()
## [1] "Sun Nov 26 17:03:52 2023"
library(tidyverse); library(dplyr); library(ggplot2); library(GGally) #load libraries
data <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", sep="\t", header = TRUE) #data upload
str(data) #structure of the object
dim(data) #dimensions of the data
Next we will create a new dataset for analysis containing only needed information: gender, age, attitude, deep, stratergic and surface level question scores, and points.
#save the variables ready in the original data frame to the new analysis data frame
learningAnalysis <- data %>%
select(gender, Age, Attitude, Points)
#find combination variables (deep, strategic and surface level questions) and save each on their own variable
deep <- data %>%
select(starts_with("D")) #NB! Excess amount of columns selected! With the next line we include only the deep question columns.
deep <- deep[,1:12]
surf <- data %>%
select(starts_with("SU"))
stra <- data %>%
select(starts_with("ST"))
#averaging the answers of selected questions and saving them to the analysis dataset
learningAnalysis$deep <- rowMeans(deep)
learningAnalysis$stra <- rowMeans(stra)
learningAnalysis$surf <- rowMeans(surf)
#scale the combination variable Attitude back to the 1-5 scale by dividing with 10, and delete the old variable
learningAnalysis$attitude <- learningAnalysis$Attitude / 10
learningAnalysis <- subset(learningAnalysis, select = -Attitude)
colnames(learningAnalysis) #check the column names and convert if needed
colnames(learningAnalysis)[2] <- "age"
colnames(learningAnalysis)[3] <- "points"
learningAnalysis <- filter(learningAnalysis, points>0) #include only the rows where points > 0
#reorder the columns:
learningAnalysis <- learningAnalysis %>%
select(gender, age, attitude, deep, stra, surf, points)
#last check
str(learningAnalysis)
head(learningAnalysis)
Now we have dataframe prepared for the subsequent analysis. Let’s save the file to the IODS Project folder.
setwd("C:/Users/Henna/Desktop/IODS/IODS-project") #set working directory to the IODS project folder
getwd() #check that it worked
write_csv(learningAnalysis, file= "learning2014.csv") #save file as csv
setwd("C:/Users/Henna/Desktop/IODS/IODS-project") #set working directory
learningAnalysis <- read.csv("learning2014.csv", header=TRUE) #read file into R
str(learningAnalysis) #structure of the data frame
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
head(learningAnalysis) #first rows of the data frame
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
summary(learningAnalysis) #summary of the variables
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
Data description:
Research question: Does students’ gender/age/attitude/question scores impact the success in the course completion (gained points)? This is measured with several different level (deep, strategic, surface) questions. Set of questions are used to quantify attitude on scale 1-5. Points represent the points that students have gained from the course. The data also includes information of students age and gender.
Data frame structure: There are 166 observations in each variable. There are total 7 variables: gender (character), age (integer), attitude (numeric), deep (numeric), strategic (numeric) and surface (numeric) level questions, and points (integer).
The summary table above describes summaries of variables: min, max, mean, median and 1st and 3rd quartiles.
There are 110 females and 56 men in this study.
The students are aged from 17 up to 55.
Attitude scores gained range between 1.4-5.0, average being 3.14.
The mean scores of deep, strategic and surface questions are 3.68, 3.12 and 2.79, respectively.
Minimum points gained is 7.00, maximum points 33. Average of points is 22.72. 50% of all scores fit to the range 19.0-27.75 points.
To test this, we will perform regression analysis, which is a statistical method that describes the relationship between two or more variables.
Graphical overview of the data:
#Function for adding correlation panel
panel.cor <- function(x, y, ...){
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- round(cor(x, y), digits=2)
txt <- paste0("R = ", r)
cex.cor <- 0.8/strwidth(txt) #if want to adjust text size based on R value
text(0.5, 0.5, txt, cex = 1)} #if adjusting; cex=cex.cor
#Function for adding regression line
upper_panel_regression_line = function(x,y, ...){
points(x,y,...)
linear_regression = lm(y~x)
linear_regression_line = abline(linear_regression)}
my_cols <- c("#00AFBB", "#E7B800") #set colors
learningAnalysis$gender<-as.factor(learningAnalysis$gender) #for being able to change color, convert gender to factor type
#Visualization with a scatter plot + regression line matrix, add R values to the lower side of the matrix.
pairs(learningAnalysis[-1], col = my_cols[learningAnalysis$gender],
lower.panel = panel.cor , upper.panel = upper_panel_regression_line)
Scatter plot shows the distribution of the observations (female turquoise, male yellow). Regression lines give some indication whether there is or isn’t an association between two variables. For instance, there seems to be positive dependency between the attitude and points: line goes upwards and it is steeper than any other line. Also, R-value (correlation coefficient) for points vs. attitude is 0,44, suggesting positive correlation between these variables. If the slope is (close to) horizontal and R is (close to) zero, it means that there is no association between the variables. This kind of case is for example between deep and points variables. Then again, if R-value is negative and the slope of regression line is negative (line goes downhill), like between surf and points variables, the association is negative. This means that when the surface question gets higher value, the points are more likely lower. Thus, with this kind of modelling we can also predict the success at the course based on the answers to the preliminary questions. Better explanation of R-values and their meaning comes after the next figure.
Let’s check if analysis executed separately in male and female reveals differences in association of the variables.
# create a more advanced plot matrix with ggpairs()
ggpairs(learningAnalysis, mapping = aes(col = gender, alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20)))
Let’s go through the figure. Red = women, Blue = men.
There are about two times more women in this study.
The boxplots reveal that the median age of women is a bit lower than that for men. Also, women median score of attitude is a bit lower. Strategic & surface question scores in turn are slightly higher in women.
Also density plots reveal the same. They also show that distribution of attitude and surface question scores are quite different in male and female. Density plots of age reveal that the data is right skewed. This means that there are more young participants than older ones. Age, Stra and surf have clearly unimodal distribution (=only one peak). Other variables have 1-2 peaks, sometimes depending on the gender. The scatter plot also shows the distribution of values. The same applies with the histograms. The clearest way to make conclusions from the distributions is still with the density plot.
As we are focused on studying the causal relationship between dependent variable ´points´ and explanatory variables age, attitude, deep, stra and surf, let’s focus on right most column of correlation coefficient values (measures linear correlation between two sets of data). This article presents the rule of thumb (Mukaka, 2012; https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3576830/) how to interpret different coefficient values. The strongest association is between attitude and points. This association is strong also in both genders separately. Interstingly, age seems to have stronger association with points in men than in women. This assocation is negative, meaning that older age is associated with lower scores. The next biggest impact seems to be with stra and surf variables, but their coefficients are already quite low and not statistically significant.
Multiple regression model
Now we select 3 explanatory variables to explain dependent variable “points”. This selection is based on the slopes of regression lines and R values in the figures above. The 3 highest absolute R values are selected: attitude (R=0.44), stra (R=0.15) and surf (R=-0.14) (genders not separated in the analysis).
#Multiple regression analysis
# create a regression model with multiple explanatory variables
my_model<- lm(points ~ attitude + stra + surf, data = learningAnalysis)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learningAnalysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
P-value of the F-statistic is 3.156e-08, which is highly significant. This means that at least one of the predictor variables (attitude, stra, surf) is significantly related to outcome variable.
To identify which predictor variables are significant, let’s examine the coefficients table, which shows the estimate of regression beta coefficients and the associated t-statistic p-values. Attitude is significantly associated with points whereas variables stra and surf show no association. Thus, we can remove these to variables from the model. Coefficient b for attitude is ~3.40, meaning that this is the average effect on our dependent variable (points) when the predictor (attitude) increases with one unit and all the other predictors do not change.
my_model_attitude<- lm(points ~ attitude, data = learningAnalysis)
# print out a summary of the model
summary(my_model_attitude)
##
## Call:
## lm(formula = points ~ attitude, data = learningAnalysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
P-value of F-statistics is significant (4.119e-09), and the model tells that when attitude grows with 1 unit, points increase about 3.5 on average. The final equation would be: points = 11.6 + attitude*3.5.
Out of curiosity, before proceeding to quality assessment, let’s check if age should be included in the model when analyzing only explanatory variables of points in men.
menstudents <- learningAnalysis %>%
filter(gender=="M")
my_model_men<- lm(points ~ attitude + age, data = menstudents)
summary(my_model_men) #summary of the model
##
## Call:
## lm(formula = points ~ attitude + age, data = menstudents)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6798 -3.2162 0.5482 2.8711 9.3838
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.66207 4.56036 2.996 0.004156 **
## attitude 4.10182 1.10353 3.717 0.000487 ***
## age -0.16050 0.08465 -1.896 0.063418 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.285 on 53 degrees of freedom
## Multiple R-squared: 0.2538, Adjusted R-squared: 0.2256
## F-statistic: 9.013 on 2 and 53 DF, p-value: 0.0004272
Interestingly, this analysis indicates that age might be associated (negatively) with points in men: the older the men, the less points. P-value 0.06 is very close to statistical significance (p<0.05). However, let’s not continue further with this dataset but rather analyse both men and women together.
Quality assessment
Finally, we should perform quality assessment of the model, based on R-squared (R2) and Residual Standard Error (RSE). R2 is sensitive for the number of variables included in the model and it is adjusted to correct for the number of explanatory variables included in the prediction model. The adjusted R2 = 0.1856, meaning that “~19% of the variance in the measure of points can be predicted by attitude score. If we compare the adjusted R2 value to the previous model where we had 3 predictor variables, the values are not very different, meaning that having 3 predictors in the model did not improve the quality of the model.
#error rate
summary(my_model_attitude)$sigma/mean(learningAnalysis$points)
## [1] 0.2341752
The RSE estimate gives a measure of error of prediction. The lower the RSE, the more accurate the model is. Here we calculated the error rate by dividing the RSE by the mean of outcome variable. Thus, RSE 5.32 is corresponding to 23% error rate.
Last thing to do is to graphically explore the validity of our model assumptions by Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage plot. Residual is the difference between the observed value and the mean value that the model predicts for that observation.
# draw diagnostic plots using the plot() function
par(mfrow = c(2,2))
plot(my_model_attitude, which=c(1,2,5))
par(mfrow = c(1, 1)) #reset plotting matrix:
Conclusion: The final course points of the students are positively associated with the attitude scores based on the preliminary question asked before the course: the higher attitude score, the more points. Only one explanatory variable is included in the model as rest were not reaching significance. Quality assessment reavealed that our model is reliable.
We are exploring data from two questionnaires related to student performance in secondary education of two Portuguese schools. The data includes student grades, demographic, social and school related variables. Here we have combined data sets of Mathematics and Portuguese language subjects.
Here we study the relationships between alcohol consumption with selected variables.
Data source: http://www.archive.ics.uci.edu/dataset/320/student+performance
#import data
library(readr)
alc<-read_csv("C:/Users/Henna/Desktop/IODS/IODS-project/Data/student_performance_alcohol.csv")
alc<-as.data.frame(alc);
colnames(alc) #column names
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
str(alc) #structure of the dataset
## 'data.frame': 370 obs. of 35 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : num 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : num 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : num 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime: num 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : num 2 2 2 3 2 2 2 2 2 2 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ activities: chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : num 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : num 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : num 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : num 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : num 1 1 3 1 2 2 1 1 1 1 ...
## $ health : num 3 3 3 5 5 5 3 1 1 5 ...
## $ failures : num 0 0 2 0 0 0 0 0 0 0 ...
## $ paid : chr "no" "no" "yes" "yes" ...
## $ absences : num 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : num 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : num 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : num 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
All the variables listed. The variables not used for joining the two data have been combined by averaging (including the grade variables). Alcohol use (‘alc_use’) is the average of workday and weekend alcohol consumption. If average is higher than 2, alcohol consumption is considered ‘high use’. Rest of the variables are describes in the website (link given above).
We have 370 obsrevations and 35 variables in the dataframe. There are character, numerical and logistic type of variables.
Next we select 4 intersting variables to further study their relationship with the alcohol consumption. Let’s visualize the variables:
library(tidyr); library(ggplot2); library(dplyr);
gather(alc) %>% ggplot(aes(value)) +
facet_wrap("key", scales = "free") +
geom_bar(fill="#00AFBB") +
ggtitle("Barplots of all variables")
Below are listed the chosen 4 factors, brief description of variable, and expected relationship with alcohol consumption
Next we will see whether we can find support for our hypotheses with numerical and graphical exploration of data:
Density or frequency plots (depending on the variable), as well as cross-tabulations and plots to visualize
ggplot(alc, aes(x=goout)) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
ggtitle("GoOut distribution") +
theme_classic()
# create cross-tabulations and plots to visualize
library(sjPlot)
tab_xtab(var.row = alc$goout, var.col = alc$high_use, title = "Cross-Tab of GoOut & High alcohol consumption", show.row.prc = TRUE)
| goout | high_use | Total | |
|---|---|---|---|
| FALSE | TRUE | ||
| 1 |
19 86.4 % |
3 13.6 % |
22 100 % |
| 2 |
82 84.5 % |
15 15.5 % |
97 100 % |
| 3 |
97 80.8 % |
23 19.2 % |
120 100 % |
| 4 |
40 51.3 % |
38 48.7 % |
78 100 % |
| 5 |
21 39.6 % |
32 60.4 % |
53 100 % |
| Total |
259 70 % |
111 30 % |
370 100 % |
χ2=55.574 · df=4 · Cramer’s V=0.388 · p=0.000 |
plot_xtab(alc$goout, alc$high_use, margin = "row", bar.pos = "stack", coord.flip = TRUE)
Density plot shown that most of the students get ‘goout’ scoring 2-4. Cross-tabulation shows that the bigger is the ‘goout’ score, the bigger proportion there is of observations with high_use=TRUE. This indicates that our hypothesis was correct: higher value of the goout is linked with heavier alcohol consumption.
ggplot(alc, aes(x=absences)) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
ggtitle("Absences distribution") +
theme_classic()
tab_xtab(var.row = alc$absences, var.col = alc$high_use, title = "Cross-Tab of Absences & High alcohol consumption", show.row.prc = TRUE)
| absences | high_use | Total | |
|---|---|---|---|
| FALSE | TRUE | ||
| 0 |
50 79.4 % |
13 20.6 % |
63 100 % |
| 1 |
37 74 % |
13 26 % |
50 100 % |
| 2 |
40 71.4 % |
16 28.6 % |
56 100 % |
| 3 |
31 81.6 % |
7 18.4 % |
38 100 % |
| 4 |
24 68.6 % |
11 31.4 % |
35 100 % |
| 5 |
16 72.7 % |
6 27.3 % |
22 100 % |
| 6 |
16 76.2 % |
5 23.8 % |
21 100 % |
| 7 |
9 75 % |
3 25 % |
12 100 % |
| 8 |
14 70 % |
6 30 % |
20 100 % |
| 9 |
5 45.5 % |
6 54.5 % |
11 100 % |
| 10 |
5 71.4 % |
2 28.6 % |
7 100 % |
| 11 |
2 40 % |
3 60 % |
5 100 % |
| 12 |
3 42.9 % |
4 57.1 % |
7 100 % |
| 13 |
1 50 % |
1 50 % |
2 100 % |
| 14 |
1 14.3 % |
6 85.7 % |
7 100 % |
| 16 |
0 0 % |
1 100 % |
1 100 % |
| 17 |
0 0 % |
1 100 % |
1 100 % |
| 18 |
1 50 % |
1 50 % |
2 100 % |
| 19 |
0 0 % |
1 100 % |
1 100 % |
| 20 |
2 100 % |
0 0 % |
2 100 % |
| 21 |
1 50 % |
1 50 % |
2 100 % |
| 26 |
0 0 % |
1 100 % |
1 100 % |
| 27 |
0 0 % |
1 100 % |
1 100 % |
| 29 |
0 0 % |
1 100 % |
1 100 % |
| 44 |
0 0 % |
1 100 % |
1 100 % |
| 45 |
1 100 % |
0 0 % |
1 100 % |
| Total |
259 70 % |
111 30 % |
370 100 % |
χ2=43.001 · df=25 · Cramer’s V=0.341 · Fisher’s p=0.008 |
plot_xtab(alc$absences, alc$high_use, margin = "row", bar.pos = "stack", coord.flip = TRUE)
Density plot shown that most of the student have absence score zero (right skewed). Cross-tabulation shows that the more absences there are, the bigger proportion tend to have higher alcohol consumption. This indicates that our hypothesis was correct: higher number of absences is linked with heavier alcohol consumption.
ggplot(alc, aes(x=failures)) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
ggtitle("Failures distribution") +
theme_classic()
tab_xtab(var.row = alc$failures, var.col = alc$high_use, title = "Cross-Tab of failures & High alcohol consumption", show.row.prc = TRUE)
| failures | high_use | Total | |
|---|---|---|---|
| FALSE | TRUE | ||
| 0 |
238 73.2 % |
87 26.8 % |
325 100 % |
| 1 |
12 50 % |
12 50 % |
24 100 % |
| 2 |
8 47.1 % |
9 52.9 % |
17 100 % |
| 3 |
1 25 % |
3 75 % |
4 100 % |
| Total |
259 70 % |
111 30 % |
370 100 % |
χ2=14.304 · df=3 · Cramer’s V=0.197 · Fisher’s p=0.002 |
plot_xtab(alc$failures, alc$high_use, margin = "row", bar.pos = "stack", coord.flip = TRUE)
Also the density plot of failures is right skewed, meaning that most of the student pass the class Cross-tabulation shows that higher number of failures is linked with the bigger proportion of high alcohol consumption. This indicates that our hypothesis was correct: higher value of failures is linked with increase in heavy alcohol consumption.
ggplot(alc, aes(x=romantic)) +
geom_bar(fill="#69b3a2", color="#e9ecef", alpha=0.8) +
ggtitle("Romantic relationship status frequencies") +
theme_classic()
tab_xtab(var.row = alc$romantic, var.col = alc$high_use, title = "Cross-Tab of Romantic relationship status & High alcohol consumption", show.row.prc = TRUE)
| romantic | high_use | Total | |
|---|---|---|---|
| FALSE | TRUE | ||
| no |
173 68.9 % |
78 31.1 % |
251 100 % |
| yes |
86 72.3 % |
33 27.7 % |
119 100 % |
| Total |
259 70 % |
111 30 % |
370 100 % |
χ2=0.286 · df=1 · φ=0.034 · p=0.593 |
plot_xtab(alc$romantic, alc$high_use, margin = "row", bar.pos = "stack", coord.flip = TRUE)
The frequency table shows that there is about 1/3 of students in romantic relationship whereas ~2/3 are with single status. Cross-tabulation indicates that even though there is slightly bigger percentage of single + high alc use students, this difference is most likely not statistically meaningful. Thus, this results does not support our hypothesis stating ‘there might be an association between variables ’romantic’ and ‘high_use’’.
We are interested whether the alcohol consumption has an association with the chosen set of characteristics of the students. Binary logistic regression can tell us the probability of this. We choose binary logistic regression because the outcome variable, ‘high_use’, has two level (TRUE/FALSE). Explanatory variables can be other types as well.
#binary logistic regression
m <- glm(high_use ~ failures + absences + goout + romantic, data = alc, family = "binomial")
summary(m) # a summary of the model
##
## Call:
## glm(formula = high_use ~ failures + absences + goout + romantic,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0472 -0.7510 -0.5266 0.8886 2.3474
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.55115 0.44211 -8.032 9.57e-16 ***
## failures 0.56492 0.22385 2.524 0.011612 *
## absences 0.07762 0.02243 3.461 0.000538 ***
## goout 0.70634 0.11846 5.963 2.48e-09 ***
## romanticyes -0.35224 0.27737 -1.270 0.204111
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 382.41 on 365 degrees of freedom
## AIC: 392.41
##
## Number of Fisher Scoring iterations: 4
Let’s go through the output:
First we have the distribution of the deviance residuals.
The next we get the coefficients, their standard errors, the z-statistic, and the associated p-values. Failures, absences and goout are statistically significant. Variable ‘romantic’ is non-significant.
The logistic regression coefficients give the change in the log odds of the outcome (high_use) for a one unit increase in the predictor variable: + for every one unit change in failures, the log odds of high_use=yes increases by 0.56. + for every one unit change in absences, the log odds of high_use=yes increases by 0.08. + for every one unit change in goout, the log odds of high_use=yes increases by 0.71.
From the results we can construct the logistic regression equation (leave out statistically non-significant variable):
log(odds[high_use=yes]) = -3.55115 + 0.56492 * failures + 0.07762 * absences + 0.70634 * goout
Next we will compute the odds ratios (OR) and confidence intervalss (CI):
coef(m) # the coefficients of the model
## (Intercept) failures absences goout romanticyes
## -3.55114961 0.56492346 0.07762078 0.70634191 -0.35223716
OR <- coef(m) %>% exp # compute odds ratios (OR)
CI <- confint(m) %>% exp # compute confidence intervals (CI)
cbind(OR, CI) # print out the odds ratios with their confidence intervals
## OR 2.5 % 97.5 %
## (Intercept) 0.02869164 0.01165358 0.06618211
## failures 1.75931312 1.14014676 2.75402869
## absences 1.08071276 1.03551949 1.13208308
## goout 2.02656433 1.61539789 2.57288191
## romanticyes 0.70311335 0.40369796 1.20112415
We can conclude the following: - for one unit increase in failures, the odds of having high alcohol consumption increases by factor of 1.76. (the outcome is 76% more likely) - for one unit increase in absences, the odds of having high alcohol consumption increases by factor of 1.08. (the outcome is 8% more likely) - for one unit increase in goout, the odds of having high alcohol consumption increases by factor of 2.03. (there is a doubling of the odds of the outcome)
CI is used to estimate the precision of the OR. A large CI indicates a low level of precision of the OR, whereas a small CI indicates a higher precision of the OR.
These results support our hypotheses of the effects of failures, absences and goouts, but not about the effect of romantic relationship status.
Next, we will explore the predictive power of the model. We will include only the statistically significant variables: failures, absences & goout.
m_pred <- glm(high_use ~ failures + absences + goout, data = alc, family = "binomial")
probabilities <- predict(m_pred, type = "response") # predict the probability of high_use
alc <- mutate(alc, probability = probabilities) # add the predicted probabilities to 'alc'
alc <- mutate(alc, prediction = probability > 0.5) # use the probabilities to make a prediction of high_use
# see the last ten original classes, predicted probabilities, and class predictions
select(alc, failures, absences, goout, high_use, probability, prediction) %>% tail(10)
## failures absences goout high_use probability prediction
## 361 0 3 3 FALSE 0.21435314 FALSE
## 362 1 0 2 FALSE 0.15791281 FALSE
## 363 1 7 3 TRUE 0.38937775 FALSE
## 364 0 1 3 FALSE 0.19036196 FALSE
## 365 0 6 3 FALSE 0.25431749 FALSE
## 366 1 2 2 FALSE 0.17871716 FALSE
## 367 0 2 4 FALSE 0.33847888 FALSE
## 368 0 3 1 FALSE 0.06266341 FALSE
## 369 0 4 5 TRUE 0.54534711 TRUE
## 370 0 2 1 TRUE 0.05843362 FALSE
table(high_use = alc$high_use, prediction = alc$prediction) # tabulate the target variable versus the predictions
## prediction
## high_use FALSE TRUE
## FALSE 237 22
## TRUE 66 45
ggplot(alc, aes(x = probability, y = high_use, col = prediction))+ #plot of 'high_use' versus 'probability' in 'alc'
geom_point()+
ggtitle("Plotted confusion matrix results ")
The printout of the dataframe shows the new columns; predicted probabilities and prediction of high_use.
A confusion matrix visualizes and summarizes the performance of a classification algorithm: + true negatives: 237 + true positives: 45 + false positives (type I error): 22 + false negatives (type II error): 66
Next, let’s compute the average number of incorrect predictions. The mean of incorrectly classified observations can be thought of as a penalty (loss) function for the classifier. Less penalty = good.
table(high_use = alc$high_use, prediction = alc$prediction) %>% # tabulate the target variable versus the predictions
prop.table() %>%
addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.64054054 0.05945946 0.70000000
## TRUE 0.17837838 0.12162162 0.30000000
## Sum 0.81891892 0.18108108 1.00000000
loss_func <- function(class, prob) { # define a loss function (mean prediction error)
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability) #compute the average number of wrong predictions in the (training) data
## [1] 0.2378378
This analysis revealed that the average number of wrong predictions is ~24%.
Now we continue to the 10-fold cross-validation of the model
# K-fold cross-validation
library(boot)
#trainingdata
cv_train <- cv.glm(data = alc, cost = loss_func, glmfit = m_pred, K = nrow(alc))
#testingdata
cv_test <- cv.glm(data = alc, cost = loss_func, glmfit = m_pred, K = 10)
# average number of wrong predictions
cv_train$delta[1]
## [1] 0.2405405
cv_test$delta[1]
## [1] 0.2459459
The comparison of the average number of the wrong predictions in training and testing sets, we see that the error is about the same. The error is slightly smaller than in the exercise set, meaning that including failures, absences and goouts is a bit better model to predict the alcohol use than what was used in the exercise set.
Data source: MASS package in R: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html
Literature: Part IV of “MABS4IODS” (chapters 12, 17 & 18)
Important concepts:
library(MASS); library(tidyr); library(corrplot); library(dplyr); library(ggplot2) #load libraries
rm(list=ls())
data("Boston") #load the data
Let’s take a look of the Boston data:
str(Boston) #structure of the dataset
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
With this data set we can explore connections between economical, environmental, societal and cultural features.
The descriptions of the variables are listed in here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html.
Let’s explore the summary statistics of the multivariate Boston data:
– each of the variables separately
– relationships between the variables (correlations)
summary(Boston) #summary of each variable separately
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
The scatterplot of each variable-combination can be use as the first indicative visualization of associations between the variables. In addition to this, we print an illustrative correlation matrix visualization, which presents us nicely how strong is the association between the variable-pairs, and whether the association is positive or negative.
pairs(Boston,
col = "cornflowerblue",
main = "Scatterplots for each variable-combination of Boston data frame")
cor_matrix <- cor(Boston) %>% #correlation matrix
round(digits=2)
corrplot(cor_matrix, method="circle", type="upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6) ##visualize the correlations
The crime rate seems to be slightly positively associated with proportion of owner-occupied units built prior to 1940 and lower status of the population. Also, as accessibility to radial highways gets better and property-tax rate increases, the crime rates per capita increases.
Not that surprisingly, there is an association between industry and nitrogen oxygen levels. Moreover, higher nitrogen oxygen concentration seems to be associated with higher proportion of owner-occupied units built prior to 1940 & lower population status. The scatterplot indicates that the air pollution concentration and industry variables are in turn negatively associated with the distance to the Boston employment centres and median value of owner-occupied homes.
Lower number of rooms/apartment seems to be linked with lower population status. As expected, average number of rooms is positively associated with the median value of owner occupied homes. Lower status of the population is clearly linked with reduced median value of owner-occupied homes.
Moreover, there is a negative association between proportion of owner-occupied unit built prior to 1940 with distance to employment centres, and a positive correlation between the accessibility to radial highways and full-value property-tax rate per $10000.
The variables are on very different scales. We will make them more comparable by standardizing the dataset.
boston_scaled <- scale(Boston) # center and standardize variables
summary(boston_scaled) # summaries of the scaled variables
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
When scaling the data, we subtract the column means from the corresponding columns and divide the difference with standard deviation. This is why the mean is 0 in all of the variables. After scaling (centering & standardizing) we can better compare the variables with each other. This website briefly lists the cons of centering and scaling the variables: https://www.goldsteinepi.com/blog/thewhyandwhenofcenteringcontinuouspredictorsinregressionmodeling/index.html
“A further question that is often of interest for grouped multivariate data is whether or not it is possible to use the measurements made to construct a classification rule derived from the original observations (the training set) that will allow new individuals having the same set of measurements (the test sample).” -Part IV of “MABS4IODS” -> discriminant function analysis (chapter 18)
Now will perform linear discriminant analysis: the idea is to find a linear combination of features that characterizes or separates two or more classes of objects or events. When we want to use a statistical method to predict something, it is important to have data to test how well the predictions fit. Splitting the original data to test and train sets allows us to check how well our model works.
Our target variable is crime rate per capita by town. Rest of the variables are predictor variables. Our interest lies in deriving a classification rule that could use measurements of the predictor variables to be able to identify the individual’s placement on the categories of crim variable.
We start with converting the crim variable to categorical and dividing the data into four categories: low, med_low, med_high and high crime rates per capita.
class(boston_scaled) # class of the boston_scaled object
## [1] "matrix" "array"
boston_scaled<-as.data.frame(boston_scaled) # change the object to data frame
#Create a factor variable from numerical: binning by quantiles; variable `crim` (per capita crime rate by town)
summary(boston_scaled$crim) #summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
bins<-quantile(boston_scaled$crim) #save quantiles, bin limits, in 'bins'
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Next we’ll divide the data into train (80% of the data) and test sets.
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data; save for later to calculate how well the model performed in prediction
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
And finally perform the analysis and plot the results:
set.seed(123)
lda.fit <- lda(crime~., data = train)
lda.fit # print the lda.fit object
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2524752 0.2400990 0.2475248 0.2599010
##
## Group means:
## zn indus chas nox rm age
## low 1.02691124 -0.8925703 -0.079333958 -0.9001291 0.46186273 -0.9227692
## med_low -0.06288915 -0.2401781 0.011791568 -0.5381983 -0.14024936 -0.2851063
## med_high -0.38261997 0.1663357 0.278864965 0.3712962 0.09714482 0.4430410
## high -0.48724019 1.0170492 -0.009855719 1.0363160 -0.35876221 0.8131380
## dis rad tax ptratio black lstat
## low 0.9064412 -0.6947544 -0.7434325 -0.4223465 0.37795855 -0.767231531
## med_low 0.3383899 -0.5473481 -0.4451896 -0.0294608 0.34690459 -0.109555364
## med_high -0.3614090 -0.4524279 -0.3438243 -0.2598373 0.08615784 0.008337242
## high -0.8556434 1.6388211 1.5145512 0.7815834 -0.81439967 0.924683601
## medv
## low 0.52419927
## med_low -0.01118933
## med_high 0.16409444
## high -0.68887591
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.11190869 0.70503617 -0.89220509
## indus 0.05692257 -0.03176740 0.30292827
## chas -0.10205539 -0.07611978 0.06517590
## nox 0.36947641 -0.71672595 -1.39968094
## rm -0.15755420 -0.03806161 -0.18356450
## age 0.19807599 -0.37258245 -0.14949317
## dis -0.03159638 -0.17851946 0.08271081
## rad 3.68776285 1.02341687 -0.17504321
## tax 0.03714716 -0.15864637 0.71011407
## ptratio 0.12597155 -0.01172305 -0.25822042
## black -0.10478341 0.05472561 0.20858029
## lstat 0.19629225 -0.19460411 0.29784999
## medv 0.20175163 -0.35152917 -0.17941522
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9597 0.0312 0.0091
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col=classes, pch = classes)
lda.arrows(lda.fit, myscale = 1.2)
Next, we use predict() to classify the LDA-transformed testing data. Finally, we calculate the accuracy of the LDA model by comparing the predicted classes with the true classes.
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 14 11 0 0
## med_low 4 21 4 0
## med_high 0 6 17 3
## high 0 0 0 22
Let’s calculate the accuracy of the prediction: (15+15+19+31)/102 * 100% = ~78% of the predictions are correct ( ~22% of observations are misclassified).
Cluster analysis is a generic term for a wide range of numerical methods with the common goal of uncovering or discovering groups or clusters of observations that are homogeneous and separated from other groups. Clusters are identified by the assessment of the relative distances between points. Clustering means that some points (or observations) of the data are in some sense closer to each other than some other points.
Classifcation starts with calculating interindividual distance matrix or similarity matrix. There are many ways to calculate distances or similarities between pairs of individuals, here we use Euclidean distance. After calculating distances, we proceed to run the k-means algorithm.
K-means clustering is a unsupervised method, that assigns observations to groups or clusters based on similarity of the objects
data("Boston") #load the data again
boston_scaled_2 <- scale(Boston) # scaling variables
boston_scaled_2<-as.data.frame(boston_scaled_2) # change the object to data frame
dist_eu <- dist(boston_scaled_2) # euclidean distance matrix
summary(dist_eu) # look at the summary of the distances
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# k-means clustering
km <- kmeans(boston_scaled_2, centers = 3) #centers = number of clusters
# plot the Boston dataset with clusters
pairs(boston_scaled_2, col = km$cluster)
Summary table of eucledian distances show the min, 1st quartile, median, mean, 3rd quartile and maximum of the distances between two points.
Let’s determine the optimal number of clusters
When plotting the number of clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically
#K-means might produce different results every time, because it randomly assigns the initial cluster centers. The function `set.seed()` can be used to deal with that.
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled_2, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
The plot above represents the variance within the clusters. It decreases as k increases, but it can be seen a bend at k = 6. This bend indicates that additional clusters beyond the sixth have little value. In the next section, we’ll classify the observations into 6 clusters. (more info: https://www.datanovia.com/en/lessons/k-means-clustering-in-r-algorith-and-practical-examples/)
km <- kmeans(boston_scaled_2, centers = 6) # k-means clustering
print(km)
## K-means clustering with 6 clusters of sizes 49, 47, 34, 78, 124, 174
##
## Cluster means:
## crim zn indus chas nox rm age
## 1 -0.3774003 -0.1398479 -0.8703728 -0.2723291 -0.2390554 1.5399833 0.07933756
## 2 -0.2834643 -0.4872402 1.5965043 -0.2723291 1.0453673 -0.6245590 0.94375129
## 3 -0.1985497 -0.2602436 0.2799956 3.6647712 0.3830784 0.2756681 0.37213224
## 4 -0.4126954 1.9952361 -1.1032525 -0.2218534 -1.1552982 0.6080657 -1.40363754
## 5 1.1156495 -0.4872402 1.0149946 -0.2723291 0.9916473 -0.4276828 0.75159525
## 6 -0.3884148 -0.3253420 -0.4696145 -0.2723291 -0.4787024 -0.2866326 -0.25638178
## dis rad tax ptratio black lstat
## 1 -0.2868574 -0.520140581 -0.82627237 -1.0314738 0.35523433 -0.9636997
## 2 -0.8963217 -0.622669809 0.03772813 -0.2084479 -0.08717824 0.7185475
## 3 -0.4033382 0.001081444 -0.09756330 -0.3924585 0.17154271 -0.1643525
## 4 1.5772490 -0.627024335 -0.57588304 -0.7161406 0.35335146 -0.9028481
## 5 -0.8170870 1.659602895 1.52941294 0.8057784 -0.81154619 0.9129958
## 6 0.2769540 -0.587168164 -0.59021294 0.1702603 0.30993538 -0.1365045
## medv
## 1 1.6147330
## 2 -0.5152915
## 3 0.5733409
## 4 0.6621944
## 5 -0.7713403
## 6 -0.1747229
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1 6 1 1 1 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 4 6 6 6 6 6 6 6 6 6 6 6 4 4 4 4 4 4 4 6
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 6 6 6 4 4 4 4 6 6 6 6 6 6 6 6 6 6 6 6 6
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 4 6 6 6 6 6 6 6 1 1 6 6 6 6 6 6 6 1 1 1
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 2 2 3 2 2 2 2 2 2 2 2 2 3 2 3 3 2 1 2 2
## 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
## 3 1 3 3 2 2 1 2 2 2 2 2 6 6 6 1 6 6 1 1
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
## 1 1 1 1 6 6 1 4 4 4 4 4 4 4 4 4 4 4 4 4
## 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
## 4 4 4 4 4 6 6 6 3 3 3 3 3 6 6 6 3 1 3 3
## 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## 3 3 3 1 1 1 1 1 1 1 6 1 1 1 3 6 3 1 4 4
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260
## 4 6 4 4 6 6 4 6 6 4 4 4 4 4 4 4 4 1 1 1
## 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280
## 1 1 1 1 1 6 1 1 1 3 6 6 6 3 3 4 3 3 4 1
## 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300
## 1 1 3 4 4 4 4 4 4 4 4 4 4 6 6 6 6 6 4 4
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320
## 4 4 4 4 1 6 1 1 6 6 6 6 6 6 6 6 6 6 6 6
## 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340
## 6 6 6 6 6 6 6 6 6 6 6 4 4 6 6 6 6 6 6 6
## 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360
## 6 4 6 4 4 6 6 4 4 4 4 4 4 4 4 4 3 3 3 5
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380
## 5 5 5 3 3 5 5 5 5 3 3 5 3 5 5 5 5 5 5 5
## 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400
## 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420
## 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440
## 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460
## 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480
## 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500
## 5 5 5 5 5 5 5 5 2 2 2 2 2 6 6 6 6 6 6 6
## 501 502 503 504 505 506
## 6 6 6 6 6 6
##
## Within cluster sum of squares by cluster:
## [1] 209.5683 263.8282 340.7321 363.4559 984.9444 590.5735
## (between_SS / total_SS = 61.1 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Let’s visualize the clusters with the pairs() function:
pairs(boston_scaled_2, col = km$cluster)
data("Boston")
boston_scaled_bonus <- scale(Boston) # scaling
boston_scaled_bonus<-as.data.frame(boston_scaled_bonus) # change the object to data frame
#clusters: km$cluster
#LDA
set.seed(123)
lda.fit_bonus <- lda(km$cluster~., data = boston_scaled_bonus)
lda.fit_bonus # print the lda.fit object
## Call:
## lda(km$cluster ~ ., data = boston_scaled_bonus)
##
## Prior probabilities of groups:
## 1 2 3 4 5 6
## 0.09683794 0.09288538 0.06719368 0.15415020 0.24505929 0.34387352
##
## Group means:
## crim zn indus chas nox rm age
## 1 -0.3774003 -0.1398479 -0.8703728 -0.2723291 -0.2390554 1.5399833 0.07933756
## 2 -0.2834643 -0.4872402 1.5965043 -0.2723291 1.0453673 -0.6245590 0.94375129
## 3 -0.1985497 -0.2602436 0.2799956 3.6647712 0.3830784 0.2756681 0.37213224
## 4 -0.4126954 1.9952361 -1.1032525 -0.2218534 -1.1552982 0.6080657 -1.40363754
## 5 1.1156495 -0.4872402 1.0149946 -0.2723291 0.9916473 -0.4276828 0.75159525
## 6 -0.3884148 -0.3253420 -0.4696145 -0.2723291 -0.4787024 -0.2866326 -0.25638178
## dis rad tax ptratio black lstat
## 1 -0.2868574 -0.520140581 -0.82627237 -1.0314738 0.35523433 -0.9636997
## 2 -0.8963217 -0.622669809 0.03772813 -0.2084479 -0.08717824 0.7185475
## 3 -0.4033382 0.001081444 -0.09756330 -0.3924585 0.17154271 -0.1643525
## 4 1.5772490 -0.627024335 -0.57588304 -0.7161406 0.35335146 -0.9028481
## 5 -0.8170870 1.659602895 1.52941294 0.8057784 -0.81154619 0.9129958
## 6 0.2769540 -0.587168164 -0.59021294 0.1702603 0.30993538 -0.1365045
## medv
## 1 1.6147330
## 2 -0.5152915
## 3 0.5733409
## 4 0.6621944
## 5 -0.7713403
## 6 -0.1747229
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3 LD4 LD5
## crim -0.0305347753 -0.08591377 -0.108258758 0.01370669 0.12793637
## zn -0.2349745638 -0.08924451 -0.895982678 -1.47547778 0.34969495
## indus 0.1256678211 -0.64511317 1.388877801 -1.77654417 0.39972164
## chas 5.8325152872 0.24741612 -0.311207345 -0.07443844 -0.27828298
## nox -0.0025874888 0.27316108 0.279998181 -0.33251185 0.43887081
## rm -0.0806099076 -0.01151686 -0.148388969 0.12537565 0.56954956
## age 0.0721570288 -0.01868907 0.362844031 0.22642315 0.24959116
## dis 0.0698432704 0.27583463 -0.218044854 -0.54697937 -0.14293536
## rad 0.2608126124 -3.31540916 -1.507829403 1.14007512 0.04287448
## tax -0.0049952119 0.23295413 -0.381672532 -0.48815038 -0.21073890
## ptratio -0.0007726087 0.08817263 0.041915486 -0.04862072 -0.32527265
## black 0.0078197790 0.09331354 -0.003179723 0.02328372 -0.03306096
## lstat -0.1245078745 -0.13037037 -0.008198020 -0.03449407 0.25315537
## medv -0.1868981610 0.34078428 0.026517982 0.20032076 0.88234880
##
## Proportion of trace:
## LD1 LD2 LD3 LD4 LD5
## 0.6255 0.2484 0.0727 0.0387 0.0147
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
graphics::arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes_bonus <- as.numeric(km$cluster)
# plot the lda results
plot(lda.fit_bonus, dimen = 2, col=classes_bonus, pch = classes_bonus)
lda.arrows(lda.fit_bonus, myscale = 1.2)
The most influential linear separators for the clusters are chas, rad & indus.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plotly::plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color= train$crime)
The 3D plot helps with the separation of clusters that are overlapping on two axes.
###FIN###